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Abstract 

We  describe  an  algorithm  for  performing  a  joint  schedul¬ 
ing/interconnect  synthesis  optimization  for  System-on-Chip 
(SoC)  architectures.  The  algorithm  is  able  to  account  for 
different  distributions  of  long  vs.  short  interconnect  routes 
in  an  architecture.  It  is  based  on  a  genetic  algorithm,  and 
utilizes  a  graph  isomorphism  test  to  significantly  pare  the 
search  space  and  increase  the  search  efficiency. 


1.  Introduction 

Interconnect  synthesis  is  important  for  today’s  system- 
on-chip  (SoC)  designs.  As  transistor  density  increases, 
more  functional  units  can  be  placed  on  a  single  chip, 
and  the  number  of  possible  interconnections  (links)  be¬ 
tween  them  increases.  The  longest  wires  on  the  chip  are 
usually  due  to  these  links.  These  wires  contribute  to  de¬ 
lay  and  limit  the  maximum  achievable  clock  rate.  Also, 
routing  these  interconnections  is  a  significant  chal¬ 
lenge  for  the  EDA  tools.  A  number  of  today’s  architectures 
for  SoC  provide  special  routing  tracks  for  long  intercon¬ 
nects.  Future  architectures  may  even  incorporate  optical 
interconnects.  In  this  paper,  we  develop  methods  for  de¬ 
riving  efficient  interconnection  networks  in  these  archi¬ 
tectures.  The  idea  is  that  if  we  can  incorporate  routing 
constraints  in  the  high  level  front  end  design  stage,  place¬ 
ment  and  routing  can  be  improved  in  the  back  end  of  the 
design  process  and  performance  will  increase. 

Embedded  systems  typically  run  a  limited  and  fixed  set 
of  applications.  We  can  use  this  application-specific  infor¬ 
mation  to  optimize  the  interconnection  network.  For  our 
purposes,  an  optimal  network  is  defined  in  the  context  of  a 
set  of  applications  and  constraints.  The  constraints  may  in¬ 
clude  the  latency,  throughput,  and  power  consumption  for 
the  given  applications,  along  with  cost  and  area  constraints 


of  the  overall  system.  A  key  distinguishing  feature  to  our  al¬ 
gorithm  is  that  we  perform  the  application  scheduling  and 
interconnect  synthesis  jointly — most  existing  interconnect 
synthesis  algorithms  assume  a  given  schedule. 

2.  Routing  Constraints 

Most  FPGA  designs  use  a  heirarchy  of  interconnect  seg¬ 
ments  of  differing  lengths.  In  the  Atmel  AT40K,  for  exam¬ 
ple,  each  logic  cell  connects  directly  to  its  nearest  neighbors 
through  fast  dedicated  local  paths.  These  cells  are  nested  in 
a  mesh  of  longer-range  interconnect  buses  spanning  eight 
cells. 

Several  research  groups  have  proposed  pipelined  FPGA 
architectures  [Q,  ED  which  provide  for  long  interconnects 
through  a  large  number  of  registers.  For  example,  the  RaPid 
architecture  is  targeted  to  high-throughput  applications  like 
those  found  in  DSP.  The  interconnect  structure  consists  of 
both  long  track  and  short  track  interconnects.  Short  tracks 
are  used  to  achieve  local  connectivity  between  functional 
units,  while  long  tracks  traverse  longer  distances  along  the 
datapath.  It  is  shown  in  [ED  that  the  area-delay  product  in 
this  architecture  is  sensitive  to  the  short  track  /  long  track 
ratio. 

In  a  system  utilizing  optical  interconnects,  cost  and  area 
constraints  dictate  the  total  number  of  transmitters  and  re¬ 
ceivers  in  the  system  (i.e.,  total  number  of  optical  links). 
Routing  constraints  from  local  partitions  to  their  associ¬ 
ated  VCSEF  transmitters  and  detectors  dictate  a  maximum 
fanout  for  each  local  partition.  An  optimum  interconnect  is 
then  one  that  minimizes  the  number  of  links  while  enabling 
the  application  to  meet  the  power,  latency,  and  throughput 
constraints. 

Our  general  model  for  a  system-on-chip  (SoC)  is  one  in 
which  the  chip  is  partitioned  into  regions  that  are  connected 
with  local  interconnects,  and  these  local  regions  are  then 
connected  through  longer  global  interconnects.  The  appli¬ 
cations  consist  of  task  graphs  [O],  where  the  individual 
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tasks  must  fit  fully  into  a  local  region.  The  graph  vertices 
(tasks  or  nodes)  in  the  acyclic  task  graphs  represent  com¬ 
putations  while  the  edges  represent  the  communication  of 
a  packet  of  data  from  a  source  task  to  a  sink  task.  Previous 
work  [ED  has  addressed  global/local  partitioning,  schedul¬ 
ing  onto  arbitrary  interconnect  topologies,  and  a  prelimi¬ 
nary  interconnect  synthesis  algorithm.  We  express  the  rel¬ 
ative  distribution  of  local  interconnects  to  global  intercon¬ 
nects  as  a  fanout  constraint,  which  is  simply  the  number 
of  long  (global)  interconnects  available  for  each  local  re¬ 
gion.  The  strength  of  the  interconnect  synthesis  algorithm 
described  in  this  paper  is  its  ability  to  handle  interconnect 
fanout  constraints. 


3.  Link  Synthesis  using  Genetic  Algorithm 


child 


child 


Our  interconnect  synthesis  method  utilizes  a  genetic  al¬ 
gorithm  (GA)  operating  in  conjuction  with  a  list  schedul¬ 
ing  algorithm.  The  scheduling  algorithm  is  a  dynamic  level 
scheduling  (DLS)  algorithm  [ID]  modified  for  arbitrary  in¬ 
terconnection  networks.  The  algorithm  takes  into  account 
constraints  on  the  total  number  of  links  Zmax  and  a  maxi¬ 
mum  fanout  for  each  processor  or  functional  unit  /max,  as 
described  earlier  and  motivated  by  area  and  cost  constraints 
for  the  system. 

3.1.  Genetic  Algorithm  Overview 

We  give  a  brief  overview  of  genetic  algorithms  here  in 
order  to  explain  our  link  synthesis  algorithm.  When  a  ge¬ 
netic  algorithm  is  used  to  solve  an  optimization  problem, 
it  is  necessary  to  be  able  to  represent  a  single  solution  to 
the  problem  with  a  single  data  structure.  This  representa¬ 
tion  is  often  called  a  chromosome  or  an  individual.  The 
quality  or  fitness  of  a  given  solution  is  evaluated  using  an 
objective  function.  Genetic  algorithms  are  capable  of  both 
broad  search  (exploration)  and  local  search  (exploitation) 
of  a  search  space.  They  are  often  preferred  than  gradient 
search  methods  because  they  avoid  local  minima,  and  do 
not  require  a  smooth  search  space. 

The  genetic  algorithm  creates  an  initial  population  of 
candidate  solutions  using  an  initialization  operator.  Often 
the  initial  population  is  distributed  randomly  over  the  search 
space.  The  genetic  algorithm  first  selects  individuals  from 
the  population  and  performs  crossover  and  mutation  op¬ 
erations  on  these  individuals.  Traditional  crossover  gener¬ 
ates  two  children  from  two  parents  in  a  population.  This 
is  depicted  in  Figure  [I]  for  a  chromosome  whose  represen¬ 
tative  data  structure  is  an  array.  A  crossover  point  is  cho¬ 
sen,  shown  by  the  dashed  vertical  line  in  Figure  [I].  and 
the  child  chromosome  is  formed  by  the  elements  from  the 
first  parent  chromosome  to  the  left  of  the  crossover  point 
and  the  elements  from  the  second  parent  to  the  right  of  the 


Figure  1:  Crossover  operator  applied  to  array  chromosome. 


crossover  point.  The  mutation  operator  specifies  a  proce¬ 
dure  for  changing  (mutating)  an  individual.  The  specifics 
of  the  mutation  depend  on  the  data  structure  used  to  rep¬ 
resent  an  individual.  A  typical  mutation  operator  for  an  in¬ 
dividual  represented  by  a  binary  string  flips  the  bits  in  the 
string  with  a  given  probability  (the  mutation  probability). 
One  generation  of  a  genetic  algorithm  consists  of  perform¬ 
ing  crossover  and  mutation  on  individuals  in  the  population. 
There  are  many  possibilities  for  evolving  the  population.  A 
simple  GA  uses  non-overlapping  populations.  Each  gener¬ 
ation  creates  an  entirely  new  population  of  individuals.  A 
steady  state  GA  uses  overlapping  populations,  in  which  a 
fraction  of  the  population  is  replaced  in  each  generation.  In 
an  incremental  GA  each  generation  consists  of  only  one  or 
two  children. 

3.2.  Problem  representation 

In  our  algorithm,  the  individuals  are  bit  vectors  corre¬ 
sponding  to  a  given  interconnect  topology.  The  fitness  func¬ 
tion  for  a  chromosome  in  our  interconnect  synthesis  algo¬ 
rithm  is  described  by 

fitness  =  M(1  +  Pf  +  Pi)  (1) 

where  M  is  the  makespan  (latency)  calculated  by  the  mod¬ 
ified  DLS  algorithm  for  the  interconnect  topology  of  the 
chromosome,  Pf  (equation  [])  is  a  penalty  based  on  vio¬ 
lating  the  fanout  constraint  /max,  and  I)  (equation  0)  is 
a  penalty  based  on  violating  the  maximum  link  constraint 

Imax- 

We  assume  that  the  links  are  directional — that  is,  each 
link  has  one  dedicated  transmitter  side  and  one  dedicated 
receiver  side.  Optical  links,  for  example,  consist  of  a  laser 
on  one  side  of  the  link  and  a  detector  on  the  other  side. 


This  assumption  does  not  limit  our  model,  however,  because 
any  bidirectional  link  can  be  represented  as  two  directional 
links.  We  define  a  link  vector  as  a  bit  vector  with  one  en¬ 
try  for  each  possible  interconnection  between  two  proces¬ 
sors.  For  a  system  with  N  processors,  there  are  N(N  —  1) 
entries  in  the  link  vector.  The  link  vector  for  a  four  proces¬ 
sor  system  would  be  denoted  as 

l  =  ((01(02(03(10(12(13(20(21(23(30(31(32)  (2) 

where  l7  J  equals  one  if  there  is  a  connection  from  processor 
i  to  processor  j  and  zero  otherwise.  We  define  l,:]  =  0  if 
i  =  j.  We  also  write  l  as 

l  =  (loh  . . .  In-i )  (3) 

where  (&  describes  the  (outgoing)  connections  for  proces¬ 
sor  k.  We  will  refer  to  the  Ik  as  processor  link  vectors.  We 
define  the  fanout  of  processor  i  by 

N- 1 

fi  =  kj  =  iiy  (4) 

3=0 

Then  the  number  of  links  is  given  as 

N—l 

ni=^2  ft  (5) 

2  =  0 

while  the  fanout  penalty  is  given  by 

N—l 

pf  =  J2p* 

i=0 

where  Pi  =  max(0,  ( ft  —  fmax)).  The  link  penalty  is  given 
by 

Pi  =  max(0,  (m  -  (max))-  (7) 

3.3.  Fanout  Constraints 

In  a  real  system,  cost  and  area  constraints  will  place 
a  limit  on  the  processor  fanout.  Therefore,  as  mentioned 
above,  it  is  important  to  have  a  link  synthesis  algorithm  that 
can  conform  to  fanout  constraints.  Our  GA  is  able  to  incor¬ 
porate  these  constraints  in  a  straightforward  manner  by  im¬ 
plementing  the  initialization,  crossover,  and  mutation  oper¬ 
ators  as  described  below. 

3.4.  Crossover  and  Mutation  Operators 

We  first  note  that  if  an  individual  topology  is  represented 
as  a  binary  string  as  in  equation  Q,  then  the  typical  crossover 
operations  like  array  one-point  crossover  (Figure  [T])  or  two- 
point  crossover  will  not  preserve  the  fanout  constraint.  This 
is  illustrated  in  Figure  [2|  where  both  parents  obey  a  fanout 


1  l|  l|  o|  l|  o|  o|  o|  l|  l|  o|  0)  child  X 


1  o|  o|  o|  o|  1  o|  l|  l|  o|  o|  0)  child  Y 

X  Y 


Figure  2:  Crossover  operation  for  link  synthesis  using  the 
binary  string  representation  Equation  0.  Link  fanout  con¬ 
straint  is  not  preserved  for  child  X ,  where  the  fanout  of  pro¬ 
cessor  0  is  fox  =  3. 


constraint  /max  =  2,  but  processor  0  of  child  X  has  fanout 
fox  =  3.  This  is  because  the  crossover  point  can  be  cho¬ 
sen  at  any  point.  If  we  instead  choose  to  represent  the  topol¬ 
ogy  by  the  vector  representation  of  Equation  Q|.  fanout  con¬ 
straints  are  preserved  in  the  crossover  operation,  since  the 
processor  link  vectors  (,;  are  never  altered.  The  crossover 
operation  only  rearranges  the  relative  position  of  these  pro¬ 
cessor  link  vectors.  This  is  illustrated  in  Figure  0. 

We  also  must  ensure  that  the  initial  population  obeys  the 
link  constraint.  The  initialization  operator  generates  random 
processor  link  vectors  which  each  satisfy  the  fanout  con¬ 
straint  Equation  [J].  N  —  1  of  these  vectors  are  then  concate¬ 
nated  to  form  the  link  vector. 

The  mutation  operator  simply  chooses  a  random  bit  in 
the  link  vector,  and  sets  its  value  to  zero.  This  removes  a 
link  if  one  existed  at  this  point.  Since  the  mutation  opera¬ 
tor  only  removes  links,  the  fanout  constraint  is  preserved. 

4.  Using  Graph  Isomorphism 

If  we  consider  systems  in  which  all  the  processors  are 
identical  (homogeneous  processor  set),  then  we  can  pare 
the  design  space  significantly  if  we  only  consider  isomor¬ 
phic  ally  unique  topology  graphs.  Two  graphs  G  =  ( V,E ) 
and  G'(V',E')  are  isomorphic  if  we  can  relabel  the  ver¬ 
tices  of  G  to  be  vertices  of  G',  maintaining  the  correspond- 


parent  A 


]  parent  B 


Figure  3:  Crossover  operation  for  link  synthesis  using  the 
vector  representation  Equation  The  fanout  constraint 
/max  :  2  is  preserved  in  the  children. 


Figure  4:  Example  of  two  isomorphic  graphs. 


ing  edges  in  G  and  G' .  For  example,  the  graphs  in  Figures 
H-(a)|  and  fl(b)|  are  isomorphic  with  the  vertices  relabelled  as 
follows:  1  — >  a,  2  — >  b,  3  — >  c,  and  4  — >  d. 

Consider  a  topology  graph  G  with  e  edges  and  n  nodes 
where  each  node  corresponds  to  a  processor  and  each  edge 
corresponds  to  a  link  between  two  processors.  The  maxi¬ 
mum  number  of  edges  in  G  is  emax  =  n  (n  —  1)  correspond¬ 
ing  to  a  fully  connected  graph  (full  crossbar  interconnect).  If 
all  links  are  bidirectional,  the  topology  graph  is  undirected 
and  emax  =  n(n  —  1) / 2.  We  can  represent  the  graphs  with 
either  an  adjacency  list  or  adjacency  matrix  and  label  each 
different  representation.  Then  for  a  graph  with  e  edges  the 


N  —  4  processors,  all  links  bidirectional 

-Emax  =  6 
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Figure  5:  Isomorphically  unique  graphs  containing  E  edges 
for  n  =  4  processors.  Here  we  only  consider  undirected 
graphs  representing  bidirectional  links  in  order  to  make  the 
figure  clearer. 


number  of  different  labellings  is  given  by 


^max-  _  n{jl  1)! 

e!(emax  —  e)!  e!(n(n  -  1)  -  e)! 


(8) 

which  increases  exponentially  with  n.  The  maximum  value 
of  rig  occurs  at  e  =  emax/2.  However,  the  number  of  iso¬ 
morphically  unique  graphs  nUmque  is  much  less  than  ng. 
For  very  small  n ,  we  can  enumerate  the  different  possibili¬ 
ties  to  show  this.  Figure  [5]  depicts  the  different  isomorphic 
graphs  for  n  =  4  processors  and  e  =  3  bidirectional  links. 
There  are  20  different  graph  labellings,  but  we  observe  that 
most  are  isomorphic — only  3  are  isomorphically  unique. 

For  larger  n,  ng  increases  rapidly  according  to  Equation 
H  We  enumerated  the  possibilities  and  tested  for  isomor¬ 
phism  for  n  =  5  and  n  =  6  using  Brendan  McKays’s  nauty 
program  [0],  which  is  currently  the  fastest  published  graph 
isomorphism  testing  program.  The  results  are  shown  in  Fig¬ 
ure  0  For  n  =  6  and  e  =  12  we  observe  that  there  is  a  3  or¬ 
der  magnitude  difference  between  the  number  of  graph  la¬ 
bellings  ng  and  those  that  are  unique  (nunique).  Also,  this 
ratio  increases  with  ng. 

We  exploit  this  property  to  improve  our  genetic  algo¬ 
rithm.  As  mentioned  earlier,  each  generation  in  the  GA  con¬ 
sists  of  a  predetermined  number  of  individuals  derived  from 
the  previous  generation  by  crossover  and  mutation  opera¬ 
tors.  The  initial  population  is  generated  randomly.  However, 
the  results  from  Figure  pimply  that  a  large  fraction  of  the  in- 


FFT3 


5  nodes 


Figure  7:  GA  output  versus  generation  for  the  FFT3  appli- 
(a)  n  =  5  cation. 


6  nodes 


edges 

(b)  n  =  6 

Figure  6:  A  comparison  of  the  number  of  possible  graph  la¬ 
bellings  ng  given  by  Equation  |8]  with  the  number  of  these 
graphs  that  are  isomorphically  unique. 


dividuals  in  any  randomly  generated  population  will  be  iso¬ 
morphically  equivalent,  and  that  we  would  be  wasting  com¬ 
putation  time  by  operating  on  equivalent  interconnection 
topologies.  Instead,  we  employ  an  efficient  online  graph  iso¬ 
morphism  test  when  generating  the  population,  and  only  ac¬ 
cept  new  individuals  that  are  isomorphically  unique.  By  re¬ 
ducing  the  solution  space  by  orders  of  magnitude,  the  GA 
can  search  it  more  throroughly  in  a  given  amount  of  com¬ 
putation  time,  and  produce  better  results. 

The  graph  isomorphism  test  is  advantageous  for  the  link 
synthesis  algorithm  only  if  the  isomorphism  testing  is  effi¬ 
cient.  The  complexity  of  the  graph  isomorphism  problem  is 
still  an  open  problem — there  exists  no  known  polynomial¬ 
time  algorithm  for  graph  isomorphism  testing,  although  the 


problem  has  also  not  been  shown  to  be  NP-complete.  It  is 
thought  that  the  problem  falls  in  the  area  between  P  and  NP- 
complete,  if  such  an  area  exists  [HOO .  However,  McKay’s 
nauty  [ED  program  has  been  proven  to  be  very  efficient  in 
practice.  Although  its  worst  case  run  time  is  exponential  [0], 
an  empirical  test  of  a  large  number  of  randomly  generated 
graphs  produced  run  times  of  1 .2p2  ns  on  a  1  GHz  Pentium 
III  machine  where  p  is  the  number  of  nodes  in  the  graph  [ED 
( p  equals  the  number  of  processors  in  our  case).  By  compar¬ 
ison,  the  DLS  scheduling  algorithm  has  complexity  0(v3p) 
where  v  is  the  number  of  nodes  in  the  task  graph  [ED-  We 
modify  the  DLS  scheduling  algorithm  by  adding  a  flexibil¬ 
ity  calculation  at  each  scheduling  step.  The  complexity  of 
the  flexibility  algorithm  is  0(v( v  +  e)p\ogp),  so  the  over¬ 
all  complexity  scheduling  an  arbitrary  graph  using  the  mod¬ 
ified  DLS  scheduling  algorithm  is 

0(v4(v  +  e)p2  logp).  (9) 

The  number  of  tasks  in  the  application  will  be  much  greater 
than  the  number  of  processors  in  practice,  so  v  »  p  and 
e»  p.  Lor  randomly  generated  graphs,  the  nauty  program 
is  therefore  much  faster  than  the  modified  DLS  scheduling 
algorithm  and  we  achieve  significant  speedup  by  detecting 
and  exploiting  graph  isomorphism. 

4.1.  Experiments 

We  evaluated  our  interconnect  synthesis  algorithm  on 
several  DSP  benchmark  application  graphs.  Figure  0  shows 
the  convergence  of  the  GA  vs.  generation  number  for  an 
FFT  application,  with  population  size  N  =  100.  This  ap¬ 
plication  graph,  representing  the  computation  of  the  fast 
Fourier  transform,  was  taken  from  [BJ.  In  this  plot  the  y- 
axis  refers  to  the  schedule  makespan  of  the  best  intercon¬ 
nection  topology  found.  Figure  shows  how  the  makespan 
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Figure  8:  GA  output  for  different  processor  fanout  con¬ 
straints. 


improves  with  increasing  processor  fanout.  As  explained 
in  [CD,  there  is  an  increasing  area  overhead  associated  with 
a  greater  number  of  long  interconnects.  In  our  model,  the 
number  of  long  interconnects  is  equal  to  the  number  of  pro¬ 
cessors  times  the  average  fanout  per  processor. 

5.  Conclusion 

Interconnect  synthesis  is  becoming  an  increasingly  im¬ 
portant  problem  for  designers  of  systems-on-chip  as  the 
designs  become  larger.  We  presented  a  genetic  algorithm 
for  synthesizing  efficient  interconnection  networks  for  SoC. 
The  algorithm  works  in  conjuction  with  a  list  scheduling  al¬ 
gorithm  to  jointly  optimize  both  the  schedule  and  the  in¬ 
terconnect  topology.  The  algorithm  is  able  to  account  for 
different  distributions  of  local  vs.  global  (long)  intercon¬ 
nect  routing  tracks  via  a  processor  fanout  constraint.  It  uses 
graph  isomorphism  to  significantly  pare  the  search  space  in 
order  to  search  more  efficiently. 
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